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ABSTRACT: 

One important aspect of the relationship between spoken and written Chinese is the ranked syllable-to- 
character mapping spectrum, which is the ranked list of syllables by the number of characters that map to 
the syllable. Previously, this spectrum is analyzed for more than 400 syllables without distinguishing the four 
intonations. In the current study, the spectrum with 1280 toned syllables is analyzed by logarithmic function, 
Beta rank function, and piecewise logarithmic function. Out of the three fitting functions, the two-piece 
logarithmic function fits the data the best, both by the smallest sum of squared errors (SSE) and by the lowest 
Akaike information criterion (AIC) value. The Beta rank function is the close second. By sampling from a 
Poisson distribution whose parameter value is chosen from the observed data, we empirically estimate the p- 
value for testing the two-piece-logarithmic-function being better than the Beta rank function hypothesis, to be 
0.16. For practical purposes, the piecewise logarithmic function and the Beta rank function can be considered 
a tie. 



1 Introduction 



Chinese language has been considered to be one of the hardest to learn l anguages for non- 
natives, or at least, "strikingly different" to an European language speaker (j Wang] 1 19731 ). For 
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spoken Chinese, a sound (syllable) consists of a consonant initial (e.g. h), a vowel final (e.g. ao), 
and one of the four intonations (e.g., the dipping intonation 3: hao3). Although the syllables 
are now written in Roman/Latin alphabets under the pinyin system, and are pronounced 
similarly as many Romance/Latin languages, the four distinct intonations (changing of pitch 
contour) do not match a similar system in Western languages. For written Chinese, the level 
of degeneracy can be high. One-to-one correspondence between ideogram or character and 
syllable is rare, such as the example of neng2 which has only one character and one meaning 
of "can", "to be able of, while tens of characters per syllable is common. Besides these 
complications, each character may have several meanings depending on the context. 

The many-to-one mapping from a sound in spoken Chinese to characters in written Chinese 
crucially contributes to the complexity of the Chinese language. The term "polymorphous" can 
be used to desc ribe the situation where many characters are mapp ed to one syllable. The terms 



"hom ophone" (ISakuma et al. 



1998 



Su 



2002|) and "heterograph" (jSu 



2001 



Chang and Zhang, 



2003J) used in linguistics describe similar situations, though these are used at the word level. 



Homophome emphasizes the aspect of the same pronunciation, whereas heterograph empha- 
sizes the difference in writing. 

In order to characterize numerically this polymorphous feature of Chinese language, we 
ask the following question: on average, how many written characters correspond to a toned 
syllable? The next quantitative linguistic question is: if the number of characters per syllable 
is ranked for all syllables (with intonation), termed "syllable-to-character mapping spectrum", 
does the decrease of this measurement with rank follow a particular functional form? 

There are around 400 syllables in spoken Chinese ignoring intonation (some syllables are 
rare, and/or used only in a colloquial form). When intonation is considered, the number of 
syllables should be multiplied by 4, of the order of 1600. The number of Chinese characters 
included in a given dictionary, however, is not fixed. Besides a core group of commonly used 
characters, which can be in the range of low thousands (e.g. ~ 3000 covered by primary school 



textbook: http://zd.diyifanwen.com/zidian/szb/) , the total number can be as high as 50,000 
~ 60,000. This uncertainty of the total number of Chinese characters is reminiscent of the 



uncertainty of the tota 



event" model (IBaayen 



num ber of English words described under the "large number of rare 



2002). 
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Besides these, two more facts should be considered in counting the number of characters. 
First, some characters are only used in classic literature while rarely used in modern Chinese. 
Second, a set of simplified characters (roughly 2000) were introduced in 1964 in mainland China 
{http://www.china-language.gov.cn/wenziguifan/managed/002.htm), which co-exist with the 
"traditional" characters used in Taiwan and oversea Chinese communities. Also, to consider 
Chinese as a single language from both the written and spoken language perspective obviously 



written Chinese - the Manderin (or putonghua, or Pekingese) (IWang. 



(Wang. 


2012; 


Shen. 


2011) 



In a previous study, t he ranked syllable-to-character mapping spectrum is analyzed without 
considering intonation (ILiJ |2012| ). Two dictionaries were used with very different number of 
entries: a small dictionary with 9212 characters, and a larger online dictionary with 21783 
characters. Despite the size difference of the two dictionarie s, the ran ked syllable-to-character 



mapping spectra after normalization are virtually identical (ILi 



20121 ). Based on this observa- 



tion, we will only use one dictionary for the current analysis that includes intonation. 

For a ranked syllable-to-character mapping spectrum, the x-axis is the rank of syllables, 
rank-1 for the most polymorphous syllable (with the most number of characters pronounced 
in the same way), and lowest ranks for the less polymorphous sylla bles . The y-axis is the 
normalized number of characters per syllable. The main result from (jLi.ll2012l ) is that when 
the rr-axis is logarithmically transformed, the fall-off of the spectrum is close t o a straight line. 



indicating a logarithmic 



2007 



Naumis and Cocho 



unctional form. However, the Beta rank function (IMansilla et al.. 



2008 



Mart Inez- Mekler et al.. 



20091 ) is shown to be even better in 



fitting the data (lLi]|2012l ). if only slightly better. In this paper, we will examine whether the 
same conclusion still holds when intonation-included syllables are considered. 

Different functional forms of syllable-to-character mapping spectrum provide numeric char- 
acterization of the fall-off from the polymorphic to the monomorphic (one-to-one) syllables. 



Such quantitative characterization shou 



ing the word usage in languages (IZipf. 



d pla y a similar role as the Zipf 's law in characteriz- 



19351 ). Not only such function makes the estimation 



of mean, median, variance easier, but also it helps to prioritize the learning of the language, 
and perhaps provide hints on the evolution of the interplay between the written and spoken 
Chinese. 
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Figure 1: Histogram of n c (number of characters per syllable). The x-axis (n c value) are partitioned in bins 
(bin size is 1), and y-axis is the number of syllables that fall in a bin. Also marked in the plot are: mean 
plus/minus one standard deviation of n c ; median plus/minus one MAD of n c ; the names of the top 15 syllables. 

2 Data and Methods 



Modern Chinese Small Dictionary: The 4th edition of XianDai HanYu Xiao CiDian 
(translated as Modern Chinese Small Dictionary), published by The Commercial Press in 
2006, collected 9212 characters, from 1280 intonation-included syllables. The number of toned 
syllables is less than 412 x 4 = 1648 because some intonation on certain sound do not exist. 
No traditional characters are included if the corresponding simplified character is already in 
the dictionary. Because some characters can be pronounced in multiple ways, the number of 
uniquely pronounced character count is 9505. 

An intonation-included syllable is written in pinyin followed by one of the four numbers 
(1,2,3,4) for the four intonations of the level, raising, dipping, and falling tones. For example, 
haol, hao2, hao3, hao4, with the consonant initial h, vowel final ao, plus four intonations. An 
alternative notation for the four intonations is hao, hao, hao, hao for the four tones. 

Logarithmic function and piecewise linear logarithmic function: Denote n c (r) the 
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number of characters per syllable for rank-r syllable, and rank r ranges from high (r = 1) to 
low (r = n). The n c (r) can be normalized to y r = n c {r) / Y^=\ n c( r )- Given the data { r, y r } 
(r = 1, 2, • • • , n), a simple fitting function is the logarithmic function: 

f c {r) = C + alog(r), (1) 

with one fitting parameter a (C is constrained by the normalization condition Ylr=i /c( r ) = !)• 
A piecewise logarithmic function consists of multiple logarithmic fitting functions, one for each 
rank range, e.g.: 

C + alog(r) if 1 < r < r 
C + a 1 log(r) if r < r < n . 
where vq is the partition rank point separating the two logarithmic functions for high-ranking 
and low-ranking ranges. A piecewise function can be continuous or not. To force the continuity 
for the above two-piece logarithmic function, one requires C + alog(r ) = C + a / log(r ). 
Without continuity requirement, the number of fitting parameters in the two-piece logarithmic 
function is 4: a,C ,a' ,r , and the continuity requirement re duces that number by 1, to 3. 



fc(r) 



(2) 



B eta rank function: The Beta function was proposed in (IMansilla et al 



2007 



Naumis and Cocho 



2008 



Martinez-Mekler et al. 



20091 ) for fitting ranked plots: 



(3) 



where a and b are two fitting parameters (scaling exponents). Beta function Eq.([3]) is a 
modification from the power-law function where b = 0. The b est known power-law function 
for ranked data is the Zipf's law of word usages (IZipfJ Il935l ). Note that Eq.([3]) shares a 
similar form as the Beta distribution p{x) = Cx a (l — x) 13 , but it is not a probability density 
distribution. 



Nonlinear regression: We use the R ( http:// www.r-pro ject.org/ ) package nls for non- 



linear least square procedure used in regressions (IBates and Watts] Il988l ). This non-linear 
procedure requires an initial value of the parameters, and the best fit is achieved numerically. 
Choosing the appropriate initial parameter values is important if there are multiple local 
optimal solutions. We first use a linear regression on the transformed variable, y = log(C) + 
ax + bx2 (where y = log(/), x = — log(r), x 2 = log(n + 1 — r), to obtain an estimation of the 



Li 



6 



parameter values. Then these estimations of a and b (and C) are used as the initial condition 
for the non-linear regression. 

Regression performance: Given the data {r, y r } (r = 1, 2, • • • n), the fitting performance 
of a function / c (r) is measured by sum of squared errors (SSE): 



n 



SSE = }^(f c (r)-y r y. (4) 

r=l 

Although SSE increases with the number of points (n), for the purpose of comparing two or 
several functions on the same dataset, the value of n is the same, and it is not necessary to 
normalize (divide by n) SSE. 

Model selection by AIC: Since functions with more fitting parameters should be able 
to fit the data no worse than those functions with less number of free parameters, SSE itself 
can not be used to compare two functions with different number of parameters. One method 
to discount the effect of extr a parameter is to penalize the number of parameters. The Akaike 
Information Criterion (AIC) ( lAkaike.Hl 9741 ) subtracts a term from the log-maximum-likelihood 
that is twice the number of parameter (K). In the regression context, it can be shown that 



unde r the condition of unknown variance of the noise, AIC is equal to ( jVenables and Ripley. 



19991 ): 

AIC = n \og(SSE/n) + 2K. (5) 

Among different fitting functions, the one with the smallest AIC is the best model, either due 
to smaller error SSE or due to fewer number of parameters K. 

To compare two AIC's for two different fitting functions, we have: 

AIC 2 - AIC, = n log + 2(K 2 - K 1 ). (6) 

Suppose the second function fits the data better than the first function but utilizing more 
parameters, then the first term in Eq.Q is negative, but the second term is positive. Only 
when the magnitude of the negative term is large enough to compensate the second positive 
term, is the second function selected. 

Simulated data from Poisson distribution: One can simulate a syllable-to-character 
mapping spectrum based on the observed one. For a syllable with n c characters, we can treat 
n c as the mean of a Poisson distribution: Pois(A = n c ). Note that the standard deviation of 



Li 



Poisson distribution is v^A = ^frT c and the lowest possible value is (as a comparison, if we 
use the normal distribution N(fi = n c ,a = s/n^) in simulation, we may have negative count 
values). From the observed { n c (r) } (r = 1,2, •• -n), a new set sampled from the Poisson 
distribution { n' c } is generated. We then remove points with n' c = and re-rank them which 
becomes one replicate of the simulated dataset. This process is repeated 1000 times. 
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Figure 2: Ranked syllable-to-character mapping spectrum with n =1280 syllables and 9505 characters. The 
x-axis is the rank of syllabic (r = 1 for most polymorphic syllabic, r = 1079 to 1280 for monomorphic syllables). 
The same spectrum is displayed in four different versions: (A) linear-linear; (B) (x)log-linear; (C) linear- (y)log; 
(D) log-log. 



3 Results 

Simple Statistics: The top-ranking syllables with largest numbers of characters are yi4 
(83), xil (76), bi4 (58), yu4 (57), fu2 (52). The next 10 polymorphic syllables (rank r = 6 
to 15) are zhi4 (50), ji4 (48), 114 (47), yu2 (45), jil (43), qi2 (39), shi4 (39), jue2 (36), ji2 
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(34), hui4 (34). There are 203 syllables with only one character (one-to-one, monomorphic) . 
The mean number of characters per toned syllable (n c ) is 7.4 (=9505/1280), median number 
is 5, standard deviation is 7.6, and median-absolute-deviation (MAD) is 3. The range mean 
± standard deviation (0, 15.185) covers 88.3% of all syllables, whereas the range median ± 
MAD (2, 8) covers only 37.1%. 

Figfj] shows the histogram of the number of characters per syllable n c , which summarizes 



all the above statements in a graph. Note that the statistical num bers wil 



be d ictionary- 



2002), but the 



dependent as governed by the "large number of rare event" model (IBaayen. 
shape of the histogram in Figfj] is expected to remain similar. 

Another way to describe the uneven distribution of the number of characters per syllable n c 
is how much the characters are covered under the top x% of the most polymorphic syllables. 
For example, top 1% of top syllables contain 7% of all characters, 5% contain 21%, 10% 
contains 33%, and top 25% contains 59%. Th e "few polymorphic syllables but large amount 
of monomorphic syllables", non- Gaussian-like ( jClauset et al.. 1 120091 ). distribution in Figfj] also 
points to an uneven rank distribution. 

Ranked distribution of the number of characters per syllable: Like so many phe- 
nomenological laws in quantitative ling uistics, a bette r and smoother description of the data 
is by ranking the data points in order ( iLi et al.. 1 120101 ). We rank syllables by their n c values, 
from large to small, and the resulting rank plot is in Figj2]in four different versions: regular 
(linear-linear), (x) log-linear, linear- (y) log, and log-log. FigfS^A) and Figfj^C) shows that the 
rank function is neither linear nor exponential. Figfj^B) indicates that a logarithmic or piece- 
wise logarithmic function might be good fitting functions. FigfS^D) hints that a power-law 
function can not fit the data well, but a modified one, such as the Beta rank function, might 
fit the data better (ILiJhom 

Related to the uneven distribution of n c , Figj2]can also be converted to a Lorenz curve (not 
shown) for calculating the Gini coefficient. In Lorenz curve, the a;- axis is a cumulation of the 
number of syllables (from to 100%) and ?/-axis is a cumulation of the number of characters 
(also from to 100%). If all syllables have the same number of characters, the Lorentz curve 
will overlap with the diagonal line. Gini coefficient (G) is defined as twice the area between 
the diagonal line and the Lorenz curve. G = means a complete equality among syllables, and 
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G = 1 the complete inequalit y. The Gini coefficient 



gini in the R reldist package (IHandcock and Morris. 



or th is dataset is G = 0.49 by the function 



19991 ) . This function is an implementation 



of the formula: G = n 1 (n + 1 — 2Q^ (n + 1 — r)n c (r)/ J2 r n c (r))). 



9\ 


SSE=2.4e-06 




\ ° 


piecewise log 














SSE=4e-06 
beta function 


SSE=3.09e-05 






log function 







10 



50 

syllable rank 



100 



500 



1000 



Figure 3: Ranked syllable-to-character mapping spectrum with x (rank) in logarithmic scale. Three fitting 
functions are superimposed to the data: (1) logarithmic function (thin solid line); (2) Beta rank function (thick 
solid line); (3) piecewise logarithmic function with the partition point ro = 15. 

Fitting the ranked distribution of number of characters per syllable — logarith- 
mic function and piecewise logarithmic function: The best fit logarithmic function of 
ranked y r = n c (r)/ Y2 r n c{f) shown in Figj3]is / c (r) = 5.78 x 10 -3 — 8.11 x 10~ 4 log(r). From 
Fig-El the fitting function clearly under-estimates the f c for top-ranking syllables, which is 
compensated by over-estimation of f c at low-ranking tails (there are many more low-ranking 
syllables than the high-ranking ones). 

FigIS] strongly suggests that the data can be split into two rank ranges, one for high- 
ranking syllables (from r = 1 to r = 15) and another for the rest (r > 15). Interestingly, 
the histogram of n c in Fig{TJdoes show that top 15 syllables (yi4, xil, bi4, yu4, fu2, zhi4, ji4, 
li4, yu2, jil, qi2, shi4, jue2, ji2, bui4) seem to be outliers with respect to the more continuous 
distribution for other syllables. For simplicity, we apply the two-piecewise logarithmic functions 
without requiring continuity. The logarithmic function fitting the top 15 syllables is / c (r) = 
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0.00877 - 0.00192 log(r), and that for the rest is f c (r) = 0.00532 - 0.000739 log(r). Indeed, 
the slope of the first function is steeper than the second. 

Fitting the ranked distribution of number of characters per syllable — Beta rank 
function: The less than perfect fitting of the logarithmic function in Fig j3] calls for the use of 



two-parameter functions, as previously advocated by us in ( 



( Mansilla et al.. 



2007 



Naumis and Cocho. 



2008 



Li et al. 



20 



Martmez-Mekler et al.. 



(J). T he Beta function 



2009 



has been proven 



Li et al.. 



to be a robust function that fits well diverse types of ranked linguistic data 
Figj3] shows the best-fit Beta function by non-linear regression (in thick solid line): 

0281 - r) L025 



f c (r) = 5.95 x 10' 



.0.324 



()2010|). 



(7) 



The fact that the fitting value b = 1.025 is larger than a = 0.324 indicates that unlike 
the situation of Zipf's law, the power-law function is not a good fitting function of this 
data. More discussion on th e meaning of relative magnitude of a and b can be found in 
(lAlvarez-Martinez et al. .11201 ll ) , and discussion on the range-limited rank variable versus range- 
open rank variable can be found in (jLi.ll2012l ). 

Comparison of curve fitting performance and model selection: Both Beta function 
and piecewise logarithmic function seem to fit the data better than the one-piece logarithmic 
function. To quantify the fitting performance, we use SSE as the measure of discrepancy and 
AIC to compare models. SSE's of piecewise logarithmic, Beta, and logarithmic functions are 
2.36 x 10 -6 , 3.95 x 10~ 6 and 3.09 x 10~ 5 respectively, confirming that piecewise logarithmic 
and Beta function are both better than the logarithmic function, with piecewise logarithmic 
function even better. 

The low SSE value for the piecewise logarithmic function is not affected by the continuity 
requirement. If the continuity condition is imposed and the low-ranking syllables are fitted first, 
SSE can be 2.76 xl0~ 6 , 2.63 xlO" 6 , or 2.53 xlO" 6 if the converging point is at r = 15, 15.5, 16. 
If the high-ranking syllables are fitted first, then SSE values are much worse (6.80 xlO -6 , 5.56 
xl0~ 6 , 4.54 xlO" 6 ). 

The calculation of AIC is tricky for the piecewise logarithmic function. Even though r = 15 
is not fitted by the data, it is however chosen by inspecting Figj2]^B). If we use the AIC of 
Beta function (K = 3) as the baseline value, AIC of logarithmic function (K = 2) is larger by 
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2629, and AIC of piecewise logarithmic function (without continuity) (K = 4) is smaller by 
659. The first term in Eq.flSJ, 1280 x log(2.3554/3.9534) = -662 is so large that even if the 
more severe pen alty of model complexity is imposed (e.g. Bayesian information criterion, BIC 



( Schwarz. 



19761 ) ). the piecewise logarithmic function is still selected. 



SSE2: piecewise log, SSE1 : Beta 
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Figure 4: Histogram of n ■ \og(S S E2 / S S El) of 1000 replicates simulated by the observed-data-based Poisson 
distribution, where n is the number of syllables with nonzero count of number of characters, SSE2 is the sum 
of squared errors for the two-piece logarithmic function, and SSE1 is that for the Beta rank function. 

Noise and significance: To check the robustness of the model comparison result, we 
create replicates of simulated data that are closely related to the observed one, in order to 
check how often the same result holds. The proportion of replicates that lead to the opposite 
conclusion from the observed data is the empirical p-value. The simulated data is sampled 
from the Poisson distribution with the observed number-of-character per syllable (n c ) as the 
mean. It is well known that the mean and the variance of the Poisson distribution is the same, 
thus the standard deviation is ■Jn~ c . 

For each replicate of Poisson distribution based syllable-to-character mapping spectrum, 
SSE from the Beta rank function and SSE from the piecewise logarithmic function are calcu- 
lated. For the SSE from the Beta rank function, we again use the nonlinear regression with 
the results from linear regression as the initial condition. For SSE from the piecewise loga- 
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rithmic functions, we treat the segmentation rank r as a running variable, from 2 to 1/5 of 
the maximum rank, and the r with the best SSE is chosen. The continuity between the two 
logarithmic functions is enforced, and the high-ranking syllables are fitted first. 

FigJD shows the histogram, from the simulation by Poisson distribution, of the first term 
in the AIC difference (Eq.(j6])) between the two fitting functions: n ■ log(5 S E2 / S S El) , where 
SSE2 is for the two-piece logarithmic function, SSE1 for the Beta rank function, and n the 
number of syllables with nonzero number of characters (due to random sampling, a Pois(A = 1) 
distribution has a high chance to sample a zero value). The majority of these values are 
negative, and 15% are positive. Since K 2 — K\ is only 1, it can be shown from FigJUthat at 
most 1% of the replicates have any chance to compensate the negative n ■ \og(SSE2/ SSE1) 
term by larger 2{K 2 — K{) for AIC or larger log(n) • (K 2 — K\) for BIC. This leaves 0.16 as our 
rough estimation of the p-value, or the significance in testing the hypothesis that two-piece 
logarithmic function is better than the Beta rank function. 



Discussions 



Both spoken and written languages change with time. It is pointed out in (jWang.lll973l ) that 
spoken Chinese evolves with a faster speed than written Chinese. Thus a character may be 
pronounced in a different way in ancient Chinese from that in modern Chinese. The drifting 
of pronunciation may introduce a flux, both in and out, in n c values. On the other hand, 
ancient characters are often out of favor in modern Chinese. This however may not cause a 
problem because the rarely used characters are still in the dictionary. How syllable-to-character 
mapping spectrum change with time is an interesting question of which we do not yet know 
an answer. 



The conclusion reached in (ILiJ 120121 ) . that the Beta rank function fits the syllable-to- 
character mapping spectrum better than the logarithmic function remains true when into- 
nation of syllables is considered, despite a 4-fold expansion in the x-axis. However, we have a 
new conclusion that the piecewise logarithmic function fits the data even better than the Beta 
rank function. 



Although this conclusion seems to be solid by a rigorous model selection technique, the 
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difference of fitting performance between the two is actually very small (2.36 xl0~ 6 vs. 3.95 
xlO -6 ). The rank range (r = 1 to r = 15) covered by the second logarithmic function is only 
1.2% of the total in linear scale and 37.8% in log-scale. These can be used in an argument 
against the claim that the outperformance by the piecewise logarithmic function is real. Indeed, 
the empirical p- value of 0.15-0.16 obtained by simulation via a Poisson distribution, shows that 
the statistical evidence is relatively weak, as the standard criterion for rejecting a hypothesis, 
with the implication of accepting another hypothesis, is to have a p-value smaller than 0.5. 

The relative fitting performance can be sensitive to small changes in the data. We notice 
that the high-ranking syllables are outliers in the distribution of n c in FigJTJ Outliers tend to 
be not reproducible, or have larger variability, which implies we may not reproduce the same 
15 n c values in a replicated run. The main intention of our simulation is to randomize the 
value of outliers by a Posson distribution. Examining the top-ranking syllable yi4 in Table 
1, for example, shows that only 24 or so characters are more commonly used in the modern 
Chinese, others are mostly obscure characters. Some are so rare that a Chinese reader may 
not encounter them in her lifetime. Using an even smaller dictionary will reduce n c for all 
syllables, of course, but the effect on the top ranking n c 's is less predictable. 

In conclusion, we would like to show that it is interesting to study the functional form 
of the syllable-to-character spectrum which connects the spoken and written Chinese. While 
two-piece logarithmic functions apparently outperforms the Beta rank function in our data, 
simulation shows that 15%-16% of the time the Beta function may outperform the two-piece 
logarithmic function. 
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